cyclistc is a bike sharing company based in chicago. Based on collected data stakeholders are interested in finding out how to convert casual riders into paying annual members. For easy conversion understanding behavours of each group on while using cyclistics is the one of the 3 focus of the team. which I was assinged to find out how casual riders differ from annual members in using thier services. which will be my business task for this project.
1, cyclistic: bike sharing company 2, Cyclistic exeecutive team: approver of recommended marketing program. 3, Lily Moreno: director of marketig and manager 4, Cyclistic Marketing analytics team: source, collector and analysiers of data that helps guide cyclistic’s marketing strategy.
Data used for this analysis was pre-collected for me by the cyclistic marketing analytics team. I have downloaded and stored the data approprately and created a dublicate of orgnal data before beging analysis.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/kiruk/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\kiruk\AppData\Local\Temp\RtmpOMM2fS\downloaded_packages
installed.packages("lubridate", repos = "http://cran.us.r-project.org")
## Package LibPath Version Priority Depends Imports LinkingTo Suggests
## Enhances License License_is_FOSS License_restricts_use OS_type Archs
## MD5sum NeedsCompilation Built
installed.packages("ggplot", repos = "http://cran.us.r-project.org")
## Package LibPath Version Priority Depends Imports LinkingTo Suggests
## Enhances License License_is_FOSS License_restricts_use OS_type Archs
## MD5sum NeedsCompilation Built
installed.packages("plotly", repos = "http://cran.us.r-project.org")
## Package LibPath Version Priority Depends Imports LinkingTo Suggests
## Enhances License License_is_FOSS License_restricts_use OS_type Archs
## MD5sum NeedsCompilation Built
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.1
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.1
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
divvy_trip_01 <- read.csv(file.choose())
divvy_trip_02 <- read.csv(file.choose())
divvy_trip_03 <- read.csv(file.choose())
divvy_trip_04 <- read.csv(file.choose())
divvy_trip_05 <- read.csv(file.choose())
divvy_trip_06 <- read.csv(file.choose())
divvy_trip_07 <- read.csv(file.choose())
divvy_trip_08 <- read.csv(file.choose())
divvy_trip_09 <- read.csv(file.choose())
divvy_trip_10 <- read.csv(file.choose())
divvy_trip_11 <- read.csv(file.choose())
divvy_trip_12 <- read.csv(file.choose())
It is crucial to check all data frames have similar column names before merging.
trips<- bind_rows(divvy_trip_01,divvy_trip_02,divvy_trip_03, divvy_trip_04,divvy_trip_05,divvy_trip_06,divvy_trip_07,divvy_trip_08,divvy_trip_09,divvy_trip_10,divvy_trip_11, divvy_trip_12)
Using bind_rows function rows from each individual dataframes will merge into one large data frame called trips.
skim(trips)
| Name | trips |
| Number of rows | 5667717 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5667717 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 4745862 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 4758633 | 0 |
| start_station_name | 0 | 1 | 0 | 64 | 833064 | 1675 | 0 |
| start_station_id | 0 | 1 | 0 | 44 | 833064 | 1314 | 0 |
| end_station_name | 0 | 1 | 0 | 64 | 892742 | 1693 | 0 |
| end_station_id | 0 | 1 | 0 | 44 | 892742 | 1318 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 | ▇▁▁▁▁ |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 | ▇▁▁▁▁ |
| end_lat | 5858 | 1 | 41.90 | 0.07 | 0.00 | 41.88 | 41.90 | 41.93 | 42.37 | ▁▁▁▁▇ |
| end_lng | 5858 | 1 | -87.65 | 0.11 | -88.14 | -87.66 | -87.64 | -87.63 | 0.00 | ▇▁▁▁▁ |
Merged data frame trips has empty colmun values in start_station_name, start_station_id, end_station_name, and end_station_id which are 833064,833064,892742, and 892742 respectivly. In addition to empty columns, n_missing are 5858 in both end_lat and end_lng which is affecting our complete_rate.
trips<- replace(trips, trips=='', NA)
Using replace function empty cells in the data frame will be assigned NA value. Doing so will make our data frame have 0 empty and all cells with NA Will be counted as n_missing instead.
trips<- trips[complete.cases(trips),]
Complete.cases will remove all values stored as NA thus giving us a data frame with no missing or empty values.
trips_02 <- trips %>%
distinct() %>%
unique()
since I am removing data I Will create a new version called trips_02. Code chunk will remove dublicates. It’s important to consult with stakeholders about missing values before removing as it can be filled to make it complete. But on this case missing values Will be removed.
trips_02<- trips_02 %>%
select(ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, member_casual)
The function Select will pick out selected columns out of trips_02 data frame to create a data frame that focus on stakeholders interest of answering user behavior difference. The removed columns dont aide in answering this question thus removed in the new version of trips data frame.
colnames(trips_02)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "member_casual"
A check to see data frame has all selected columns.
trips_02$trip_duration <- difftime(trips_02$ended_at,trips_02$started_at, units = "secs")
creating a new column named trip_duration that is the difference between trip ending and trip beginning. This will answer the question of trip duration among user types.
str(trips_02)
## 'data.frame': 4369360 obs. of 10 variables:
## $ ride_id : chr "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
## $ ended_at : chr "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ member_casual : chr "casual" "casual" "member" "casual" ...
## $ trip_duration : 'difftime' num 177 261 261 896 ...
## ..- attr(*, "units")= chr "secs"
trips_02$trip_duration <- as.numeric(as.difftime(trips_02$trip_duration))
Converting trip_duration to numeric value from difftime. This change will enable calculations on trip_duration.
is.numeric(trips_02$trip_duration)
## [1] TRUE
checking our conversion was successful.
trips_02$date <- as.Date(trips_02$started_at)
trips_02$month <- format(as.Date(trips_02$date),"%m")
trips_02$day <- format(as.Date(trips_02$date),"%d")
trips_02$year <- format(as.Date(trips_02$date), "%Y")
trips_02$day_of_week <- format(as.Date(trips_02$date),"%A")
trips_02$hour_minutes <- format(as.POSIXct(trips_02$started_at), "%H:%M")
The above code chunk created new columns that will further assist in answering stakeholders issues. All columns created are derived from the started_at column and they are as.Date format. By dividing started_at column to month,day,year, day_of_week, and hour_minute it enables an enhanced data analysis and visualization.
table(trips_02$rideable_type)
##
## classic_bike docked_bike electric_bike
## 2597426 174858 1597076
table(trips_02$member_casual)
##
## casual member
## 1758189 2611171
trips_03<- trips_02[!(trips_02$trip_duration <0 | trips_02$rideable_type == "docked_bike"),]
Table(trips_02$rideable_type) reveals that there are 3 types of rides and docked_bikes takes 4 % of the total rides taken out of the year. Its a minority of rides that are classified as docked_bikes. In addition arranging trip_duration from highest to lowest reveals that rides classified as docked_bikes go over a month in duration of rides which is significant compared to a normal bike ride. when removing docked_bikes ride duration drop to less than 24 hour. Correction procedures could include asking data collectors for correction, explanation or removal. For this analysis removal is the only option as the other options are unavailable. The above code chunk will remove all 174858 rides labeled as docked_bike and trips with negative values in the trip_duration column.
table(trips_03$rideable_type)
##
## classic_bike electric_bike
## 2597398 1597035
colnames(trips_03)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "member_casual"
## [10] "trip_duration" "date" "month"
## [13] "day" "year" "day_of_week"
## [16] "hour_minutes"
str(trips_03)
## 'data.frame': 4194433 obs. of 16 variables:
## $ ride_id : chr "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
## $ ended_at : chr "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ member_casual : chr "casual" "casual" "member" "casual" ...
## $ trip_duration : num 177 261 261 896 362 ...
## $ date : Date, format: "2022-01-13" "2022-01-10" ...
## $ month : chr "01" "01" "01" "01" ...
## $ day : chr "13" "10" "25" "04" ...
## $ year : chr "2022" "2022" "2022" "2022" ...
## $ day_of_week : chr "Thursday" "Monday" "Tuesday" "Tuesday" ...
## $ hour_minutes : chr "11:59" "08:41" "04:53" "00:18" ...
skim(trips_03)
| Name | trips_03 |
| Number of rows | 4194433 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 14 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 4194433 | 0 |
| rideable_type | 0 | 1 | 12 | 13 | 0 | 2 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 3653923 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 3665869 | 0 |
| start_station_name | 0 | 1 | 7 | 64 | 0 | 1556 | 0 |
| start_station_id | 0 | 1 | 3 | 44 | 0 | 1270 | 0 |
| end_station_name | 0 | 1 | 10 | 64 | 0 | 1598 | 0 |
| end_station_id | 0 | 1 | 3 | 44 | 0 | 1281 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
| month | 0 | 1 | 2 | 2 | 0 | 12 | 0 |
| day | 0 | 1 | 2 | 2 | 0 | 31 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
| day_of_week | 0 | 1 | 6 | 9 | 0 | 7 | 0 |
| hour_minutes | 0 | 1 | 5 | 5 | 0 | 1440 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2022-01-01 | 2022-12-31 | 2022-07-21 | 365 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| trip_duration | 0 | 1 | 941.7 | 1601.01 | 0 | 355 | 616 | 1085 | 89965 | ▇▁▁▁▁ |
A final check for data frame completeness and any issues to be cleaned before continuing on to the analysis phase.
min(trips_03$trip_duration)
## [1] 0
max(trips_03$trip_duration)
## [1] 89965
mean(trips_03$trip_duration)
## [1] 941.7029
median(trips_03$trip_duration)
## [1] 616
Using min, max, mean and median will indicate our data boundary on the trips_duration, average and median data.
aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = min)
## trips_03$member_casual trips_03$trip_duration
## 1 casual 0
## 2 member 0
aggregating data to specific column in this case to member_casual shows what the minumum trip duration is among member types.
aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = max)
## trips_03$member_casual trips_03$trip_duration
## 1 casual 89965
## 2 member 89594
aggregating data to specific column in this case to member_casual shows what the maximum trip duration is among member types.
aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = mean)
## trips_03$member_casual trips_03$trip_duration
## 1 casual 1262.5937
## 2 member 747.1275
aggregating data to specific column in this case to member_casual shows what the average trip duration is among member types.
aggregate(trips_03$trip_duration~trips_03$member_casual, FUN = median)
## trips_03$member_casual trips_03$trip_duration
## 1 casual 776
## 2 member 539
aggregating data to specific column in this case to member_casual shows what the median trip duration is among member types.
aggregate(trips_03$trip_duration~trips_03$member_casual+trips_03$day_of_week, FUN = mean)
## trips_03$member_casual trips_03$day_of_week trips_03$trip_duration
## 1 casual Friday 1194.1929
## 2 member Friday 733.6164
## 3 casual Monday 1276.5298
## 4 member Monday 721.9727
## 5 casual Saturday 1415.5713
## 6 member Saturday 838.8530
## 7 casual Sunday 1434.4400
## 8 member Sunday 831.3002
## 9 casual Thursday 1128.0301
## 10 member Thursday 721.9113
## 11 casual Tuesday 1133.0574
## 12 member Tuesday 707.4686
## 13 casual Wednesday 1094.0147
## 14 member Wednesday 710.8143
aggregating data to specific column in this case to member_casual and day_of_week shows what the average trip duration is among member types in a specific day of the week. However, inthis case the days are out of order and a bit hard to read.
trips_03$day_of_week <- ordered(trips_03$day_of_week,levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))
using ordered function we can rearrange days of the week from sunday to saturday making it easy to read.
aggregate(trips_03$trip_duration~trips_03$member_casual + trips_03$rideable_type, FUN=mean)
## trips_03$member_casual trips_03$rideable_type trips_03$trip_duration
## 1 casual classic_bike 1466.8318
## 2 member classic_bike 794.0642
## 3 casual electric_bike 1001.2371
## 4 member electric_bike 658.2664
aggregating data to specific column in this case to member_casual and rideable_type indicates what the average trip duration is among member types for each type of ride types in the data frame.
trips_04 <- trips_03 %>%
mutate(weekday=wday(started_at, label=TRUE)) %>%
group_by(weekday,member_casual) %>%
summarise(number_of_rides = n(), average_duration = mean(trip_duration)) %>%
arrange(member_casual,weekday)
## `summarise()` has grouped output by 'weekday'. You can override using the
## `.groups` argument.
This new dataframe consists of 4 columns with 14 rows which arranges the average trip duration for each rides along side membership type per each day of the week.
trips_05 <- trips_03 %>%
mutate(weekday=wday(started_at, label=TRUE)) %>%
group_by(member_casual,weekday,rideable_type) %>%
summarise(number_of_rides = n(), average_duration = mean(trip_duration)) %>%
arrange(member_casual,weekday)
## `summarise()` has grouped output by 'member_casual', 'weekday'. You can
## override using the `.groups` argument.
This new data frame consists of 5 columns with 28 rows. This data frame is identical to the previous data frame but this one has one more variable which is rideable_type. which arranges the average trip duration for each rides along side membership type per each day of the week by ride type.
trips_06 <- trips_03 %>%
mutate(weekday=wday(started_at,label = TRUE)) %>%
group_by(member_casual,rideable_type) %>%
summarise(number_of_rides = n(),
average_duration = mean(trip_duration)) %>%
arrange(member_casual,rideable_type)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
trips_06$number_of_rides <- as.numeric(as.character(trips_06$number_of_rides))
trips_06$average_duration <- as.numeric(as.character(trips_06$average_duration))
class(trips_06$number_of_rides)
## [1] "numeric"
class(trips_06$average_duration)
## [1] "numeric"
This new data frame consists of 4 columns and 4 rows. The data frame provides the total number of rides alongside average trip duration by membership type and ride type.
trips_07 <- trips_03 %>%
select(ride_id,rideable_type) %>%
group_by(rideable_type) %>%
summarise(number_of_rides = n())
This new data frame has 2 columns and 2 rows. The data frame provides the total number of rides by ride type.
trips_08 <- trips_03 %>%
select(ride_id,hour_minutes,member_casual) %>%
group_by(hour_minutes,member_casual)%>%
summarise(number_of_rides = n())%>%
arrange(hour_minutes)
## `summarise()` has grouped output by 'hour_minutes'. You can override using the
## `.groups` argument.
This data frame has 3 columns and 2880. It arranges the yearly ride totals by membership type and each minute of a day. This will help in visualizing peak ride hours in the day by membership type for the total annual rides taken in the data frame.
x <- trips_07$number_of_rides
label <- trips_07$rideable_type
piepercent <- round(100*x/sum(x),1)
pie(x, labels = piepercent, main = "Percentage of rides by ride type", col = c("lightblue","orange"))
legend("bottomright", c("classic_bike","electric_bike"), cex = 1, fill = c("lightblue","orange"))
ggplot(trips_06)+
geom_point(mapping = aes(x=member_casual, y=number_of_rides, color=rideable_type))+
scale_y_continuous(breaks = seq(600000,2000000, by =100000))+
xlab("Casual riders VS Members")+
ylab("total number of rides taken")+
labs(title="Casual riders VS members annual rideship by ride type")
ggplot(trips_06) +
geom_point(mapping = aes(x=average_duration, y=number_of_rides, color = member_casual, shape = rideable_type)) +
scale_y_continuous(breaks = seq(600000,2000000, by=100000))+
xlab("Average duration of trips by second")+
ylab("Total number of rides taken")+
labs(title = "Average trip duration by ride type and membership")
ggplot(trips_04, mapping = aes(x=weekday,y=number_of_rides, fill=member_casual))+
geom_col(position = "dodge")
ggplot(trips_05)+
geom_point(mapping = aes(x=weekday, y=number_of_rides, color= member_casual, shape=rideable_type))
ggplot(trips_04, mapping = aes(x=weekday,y=average_duration, fill=member_casual))+
geom_col(position = "dodge")+
ylab("Average trip duration by second")
ggplot(trips_05)+
geom_point(mapping = aes(x=weekday, y=average_duration, color= member_casual, shape=rideable_type))
plot2<- ggplot(trips_08, mapping = aes(x=hour_minutes, y=number_of_rides))+
geom_col()+
scale_x_discrete(breaks=c("12:00","06:00","18:00","00:00","23:59","23:59"))+
theme(
axis.title = element_text( color="red", size=10, face=2),
axis.line = element_line(linewidth = 3, colour = "lightgreen", linetype=1),
axis.text = element_text( angle = 90, color="blue", size=7, face=2)
)+
facet_wrap(~member_casual)+
labs(title = "Number of rides for each hour of the day in a year.")
ggplotly(plot2)